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Abstract. Given natural limitations on the length DNA sequences, designing phylo- 
genetic reconstruction methods which are reliable under limited information is a crucial 
endeavor. There have been two approaches to this problem: reconstructing partial but 
reliable information about the tree ( [HI [71 13 US] ) , and reaching "deeper" in the tree 
through reconstruction of ancestral sequences. In the latter category, [5] settled an 
important conjecture of M. Steel, showing that, under the CFN model of evolution, all 
trees on n leaves with edge lengths bounded by the Ising model phase transition can 
be recovered with high probability from genomes of length 0(log7i) with a polynomial 
time algorithm. Their methods had a running time of 0{n^'^). 

Here we enhance our methods from [5] with the learning of ancestral sequences and 
provide an algorithm for reconstructing a sub-forest of the tree which is reliable given 
available data, without requiring a-priori known bounds on the edge lengths of the 
tree. Our methods are based on an intuitive minimum spanning tree approach and 
run in 0(n'^) time. For the case of full reconstruction of trees with edges under the 
phase transition, we maintain the same sequence length requirements as [5], despite 
the considerably faster running time. 
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1. Introduction 

Reconstructing the pattern of common ancestry among species is a central problem in 
evolutionary biology. This pattern is most commonly represented as a phylogenetic tree: 
a rooted tree with leaf-set S(T) labeled by the species (or taxa) in X. Furthermore, it is 
generally assumed that phylogenies are binary: every speciation event is a divergence of 
two species from one common ancestor. Therefore the nodes V{T) of the tree are either 
leaves corresponding to extant species in the set X, or internal nodes of degree three, 
corresponding to the ancestral species at each speciation event. 

The phylogeny reconstruction problem is to discern the tree that accurately represents 
the evolutionary history of the taxa X. It is natural to identify each taxon with its 
genetic sequence and exploit molecular level differences between species to recover the 
phylogeny. To render the reconstruction problem tractable, it is commonly assumed 
that genetic sequences are correctly aligned and that sequences at the leaves are evolved 
from a root sequence according to an evolutionary Markov process on the tree: each 
edge e in the tree, corresponding to an ancestral "divergence event", is equipped with 
mutation probability matrix P(e). The sites of the sequences are evolved identically and 
independently according to these mutation probabilities. 

The amount of disagreement between two sequences will then, depending on the under- 
lying model of evolution, provide a scalar distance measure between the two sequences. 
As we detail in the next section, under suitable independence assumptions these dis- 
tances are additive: the distances between the leaf taxa X will correspond to the graph 
distance Dx given by edge lengths L on the tree T. 

Most phylogeny reconstruction algorithms rely on estimating pairwise distances be- 
tween taxa from the available genetic sequences and, in turn, using these estimates to 
recover topological information. Intuitively, reconstruction is achieved by piecing to- 
gether topologies of smaller sub-trees which have a uniquely defined supertree. For 
instance, it is a fundamental result that a binary phylogenetic tree can be correctly re- 
covered from its quartets: topologies describing the ancestral relations between subsets 
X' C X, \X'\ = 4. 

The main difficulty in the reconstruction of full phylogenies lies in the correct iden- 
tification of short and deep divergence events [T^ [S] . Intuitively, a divergence event is 
correctly recovered when the amount of mutation it induces is not drowned by mutation 
along the evolutionary paths leading away from it. Like any statistical estimator, the 
accuracy of evolutionary distance estimates is increasing in the amount of available data 
(length of the genetic sequences), but naturally decays as the variance in the system 
grows. In our case, longer biological distances have higher variance and are harder to 
estimate correctly. The probability of correctly resolving an ancestral divergence event is 
therefore naturally decreasing in the length of the pairwise distances used in its discovery, 
and increasing in the length of its corresponding edge. 

It has been shown previously that, given upper and lower bounds on the mutation 
probabilities along each edge of T, = log*^*-^-* n sites will suffice to reconstruct T 
correctly for almost all topologies T. Intuitively, this approach relies on the fact that 
most phylogenies are not very deep: all internal nodes v have enough descendants among 
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the observable present species that are a bounded number of edges away, and whose 
observed character sequences thus provide enough information to resolve the topological 
structure of T around v. 

However, for topologies containing very deep nodes (such as in perfectly balanced 
trees), the reconstruction requires accurate estimation of distances between taxa that 
are "far-apart", therefore necessitating longer character sequences. Indeed, [16] shows 
that in the case of perfectly balanced binary trees, N = rp^^^ is required for accurate 
reconstruction. 

Recently it has been a growing trend to design algorithms that do not always at- 
tempt to recover a full tree ([18], [7], [I3] and our own [5]), but only provide topological 
information that can be reliably extracted from the data, generally in the form of a 
forest of edge-disjoint subtrees of the original tree. This is a very important feature of 
reconstruction algorithms, as most real data-sets are not sufficient for recovering a full 
topology, and therefore any algorithm designed to return a full tree is bound to also give 
possibly incorrect information. 

Another possible source of improvement in the area involves the reconstruction of 
internal genomes, which therefore provides pairwise distance estimates between inter- 
nal nodes, allowing us to reach "deeper" in the topology and reconstruct from shorter 
distances. This method was introduced by Mossel [T7| for the CFN model of evolu- 
tion. He showed that for any fixed topology on n leaves with edge lengths less than 
Ao = log(2)/4, the so-called "phase transition of the Ising model on trees", arbitrarily 
deep internal sequences can be recovered with bounded probability of error. This im- 
plies that leaf sequences of length O(logn) suffice ro distinguish between all perfectly 
balanced phylogenies on n leaves. A simple information-theoretic argument shows that 
this bound is tight. 

Mossel's techniques were then used by [B] in the context of phylogeny reconstruction. 
Given a lower bound / and an upper bound g < \q for the edge lengths of T, [B] 
show that the full topology can be recovered from sequences of length O(logn), thereby 
settling an important conjecture of M. Steel. Their algorithm has a worst case running 
time of 0{n^^). In [8] it is shown that must grow at least as fast as O(logn), and 
therefore the results of [6] are asymptotically optimal. The results of [6] have been 
partially extended by Roch [19] to a general time-reversible model, with worse but still 
sub-polynomial sequence length requirements. 

Here, we will present a relatively simple algorithm which combines our approach in [S] 
with the reconstruction of ancestral sequences as detailed in [T7]. The ability to learn 
ancestral sequences is central to the success of the methods in [B] and subsequently 
our methods, as it allows the reconstruction of the model tree topology T by piecing 
together quartet topologies of bounded diameter on its internal nodes. By contrast, 
previous algorithms that have only looked at quartets on the leaves of T have achieved 
strictly weaker results. 

For trees with edges under the phase transition we achieve full topology recontruction 
with the same edge length requirements as [B]. Our algorithm relies on an intuitive 
minimum spanning tree approach: we progress recursively by growing an edge-disjoint 
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sub-forest of T. The algorithm halts when no further progress can be made reliably, 
meaning that all edges that could be added are either too short to be resolved accu- 
rately, or they violate the phase transition bound, therefore preventing further reliable 
reconstruction of ancestral genomes. 
Our contributions here are threefold: 

• We reduce the worst case running time to O(n^), thus matching that of much 
simpler phylogeny reconstruction algorithms, such as Neighbor- Joining. 

• In the case when full reconstruction is not possible with the available data, we 
return reliable partial information in the form of an edge-disjoint sub-forest of 
T. 

• We eliminate the need for a-priori knowledge of the edge length bounds / and 
g. Rather, we infer an edge length tolerance interval from the length of the 
available genetic sequences and reconstruct pieces of the tree with edges within 
this interval. 

It is worth noting that our edge length tolerance interval can in fact be controlled by 
the user. Increasing it can potentially result in a larger output forest, but will trade 
off against the expected accuracy of this output. We also note that our method implies 
similar results for all group based models of evolution where the character alphabet is 
a group G admitting a non-trivial morphism : G — > Z2. This class of models includes, 
among others, the well known Kimura 3ST [13] and Jukes-Cantor models. We elaborate 
on this technical point in Appendix [Bl 

2. Background on phylogeny reconstruction 

In this work, we concentrate on the Cavender-Farris-Neyman (CFN) 2-state model 
of evolution ([3], [10]): our genetic sequences are bit strings of some length N and the 
probability of mutation p(e) along an edge e of the tree does not depend on the starting 
state. We denote the z'th entry of the sequence corresponding to taxon a G X as Xi{(^)- 
The vectors Xi(-) ^"^^ ^-Iso known as characters of the set X. 

For each position 2, the character values at the nodes of T mutate independently along 
each edge e = G E{T), starting from a uniform distribution at the root node p, 

according to the symmetric transition matrix 

l-p{e) p{e) 



M(e) = exp{L{e)R) 



p{e) 1 - p{e) 



where R is the symmetric rate matrix ^ 1^ ) ' "^^^^ ^^^^ ~ ~los(l ~ 2p(e))/2, 

p{e) = P(xi(M) 7^ Xii'v)) ^-iid the distribution of character states at any node is also 
uniform. 

The topology T, together with the edge lengths L define a joint probability distribution 
P'j- on the character values at the nodes of T and Xi{'-) ^-i'^ i.i.d. samples from this 
probability distribution. Note that the values of ^^(m) are not known for ancestral nodes 
u e V{T) \ X. We therefore define Pt,l to be the marginal distribution of P^^. the 
leaves X. The observed character sequences Xi{^) ^ire then i.i.d. samples from Pt,l- 
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Let f2 be the set of all possible binary topologies on X. The problem of phylogeny re- 
construction is then equivalent to finding an algorithm, or estimator, A : {ibl}!"^!^^ — *• 
Q, such that the probability that A{xi, • • • , Xn) = T is maximized. As with most esti- 
mation problems, the central question then becomes: how many samples do we need 
in order to achieve accurate reconstruction of the underlying tree? 

Note: In the case of the CFN model we can only recover the "un- rooted" topology 
T, but not the location of the earliest specie in T, i.e. its root. This is because CFN is 
a reversible model of evolution, meaning that the probability distribution ^ does not 
depend on the location of the root p, and therefore neither does Pt,l- See [20] for more 
details on Markov models of evolution. 

For two uniform Bernoulli variables u,v, sharing a joint distribution P, let us define 
Dp{u,v) = — log(l — 2P[u 7^ v])/2. Note the similarity to the definition of the edge 
lengths under the CFN model. It is easy to check that for three uniform Bernoulli 
variables Vi, with i G {1,2,3}, such that {viALvs\v2), the following holds: 

D{vi, vs) = D{vi, V2) + D{v2, v-s). 

Here {viALv3\v2) means that vi and 1^3 are independent conditioned on the value of V2- 
In other words, given the Markov property of the CFN model (see [20]), for two nodes 
a, 6 G V{X) joined by a path p, we have the following relationship: 

D(x(a),x(&)) = E^(^)- 

eGp 

Here and in the remainder of our paper, D is the distance given by the joint probability 
distribution P^j^. 

This implies that knowing the joint probability distribution of character values at 
pairs of leaves will provide us with the distance between the two leaves according to the 
edge lengths defined above, which will in turn provide the topology T and the individual 
edge lengths L. In practice we will, of course, not know D precisely, but we will be able 
to estimate it from the observed character values Xii'-)^ which are i.i.d. samples from the 
marginal Pt,l- For a,b & X, define 

b{a,b) = -0.51og(l -^J2Ux^iu) ^ x^iv)]) = -0.51og(^5^x.(w)x.(^))- 

i i 

Consider the simplest example of reconstructing a quartet: a binary topology Q on 4 
leaves X = {a, b, c, d} (there is only one possible topology on 3 leaves). There are three 
possibihties, each corresponding to a pairing of the four taxa. We let Q = {a,b\c,d) 
encode the case when the taxa a, b are separated from the taxa c, d by an edge e. In the 
case Q is indeed the correct topology on the four taxa, the true pairwise distance matrix 
D satisfies the so-called four point condition: 

D{a, b) + D(c, d) < D{a, c) + D{b, d) = D{b, c) + D{a, d), 

and moreover 2L(e) = D{a, c)+D{b, d) — D{a, b)+D{c, d), where D{a, b) = D{x{ci), x{b)) 
for ease of notation. 
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Given the approximate distance the procedure FPM, as in four point method, 
gives us a way to resolve the topology of the quartet a, 6, c, d, while the procedure ME, 
as in middle edge, gives us a way to estimate internal edge lengths. We borrow some of 
our notation from [9]. 

Definition 2.1. Suppose b{a,b) + D{c,d) < b{a,c)+D{b,d) < D{b,c)+b{a,d). Then 
let 

FPM{b;a,b,c,d) = {a,b\c,d) and 

ME{b; (a, b\c, d)) = {b{a, c) + D{b, d) + c) + Z)(a, d) - 2Z)(a, b) - 2I)(c, d))/A. 

We observe that as long as \D{i,i) — D{i,i)\ < e/2 < L(e)/2, for z,j G {a,6, c, c?}, 
then FPM recovers the correct quartet topology, and that |ME(D, Q) — L{e)\ < e, where 
Q = (a, b\c, d). 

It is a fundamental fact in phylogenetics that the topology of the entire tree can 
be recovered from the topologies of its quartets (see [20] for details). The following 
proposition is the first step towards giving lower bounds on the number of samples N 
that insure proper reconstruction. Its proof is implied by the proof of Theorem 8 in [9] 
and has been proved in several other publications. 

Theorem 2.2. [9] Let u,v be uniform binary random variables with P(n ^ v) < y. 
Given N samples of u,v and the associated empirical distance D, then 

F[D{u, v) > D{u, v) + e/2] < 1.5 exp ~ Vl-2zf{l - 2yyN 

P[b{u,v) < D{u,v)-e/2] < 1.5 exp 
where e = — log(l — 2z)/2. 



:i-VT^?{l-2yyN 



Theorem 12.21 has the following easy but important corollary: in a nutshell, given a 
fixed y and M = — log(l — 2y)/2, distances larger than M will, with high probability, 
be "estimated" as longer than M — e. The proof comes from the second inequality of 
Theorem 12.21 via a standard coupling argument, and is therefore omitted. 

Corollary 2.3. Let u,v be uniform binary random variables with P(n ^ v) > y. Given 
N samples ofu,v and the associated empirical distance D, then 

r". X VT^)^(l-2y)^N 

P[D{u,v) <M -e/2]<1.5exp ^ 

8 

where e = - log(l - 2z) /2 and M = - log(l - 2y) /2. 

In general, when an estimator £) of the quantity D satisfies 

\b-D\< e/2 when D < M 
b> M - e/2 when D> M 
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we say that D is an (M, e/ 2) -estimator for D. This is a very shght modification of the 
concept of (e, M)-distortion from [18] . 

Suppose g > L(e) > /, Ve G E{T), with f,g>0 fixed. Let / > e = — log(l — 2z) and 
M = — log(l — 2y)/2. Let A be an algorithm which attempts to recover the full topology 
of T by evaluating K = 0{n^) empirical distances between pairs of random variables 
u,v. Suppose in addition that A recovers the correct topology if ALL the emprical 
distances inspected are (M, e/2) -approximations of the true distances. Theorem 12.21 and 
Corollary 12.31 guarantee that the empirical distance matrix D satisfies this property, 
with high probability, provided the number of samples N is large enough. If we want 
to ensure that P[74(D) ^ T] < 1 — p for some p > 0, plugging into the above inequality 
yields 

N > 0(e^*^A;logn). 

This inequality is essential to understanding the need for learning ancestral sequences. 
Indeed, given that the topological depth of an internal node can grow as high as O(logn), 
any method which is restricted to inspecting pairwise distances between leaves of T will 
have to estimate distances as high as M = 0{g\ogn), which yields N = rP^^\ By 
contrast, learning ancestral sequences gives us a way to resolve the entire topology by 
only inspecting K = 0{'n?) distances between nodes separated by at most a constant 
distance. If the edge lengths are under the phase transition, as explained in the following 
section, we can guarantee that the additional noise coming from estimating internal 
sequences is also bounded by a fixed amount. Thus M = 0(1) and thus = O(logn). 

3. Background on learning ancestral characters 

In this section we will show how to recover the sequences at the interior nodes of a 
phylogenetic tree from the sequences at the leaves of the tree, up to an a priori bounded 
error. We do this by means of a recursive majority algorithm. All the results in this 
section have appeared in previous publications, such as [17], and are used in an identical 
manner in [6]. For this reason we will state them without proof. 

Definition 3.1. Given a sequence of ±1 bits xi, . . . we define the majority function 

Maj{xi, . . . Xn) = sign{xi + . . . + Xn + .5w), 

where w is an unbiased ±1 random variable that is independent of the 's. 

Definition 3.2 (Definition 4.1 in [17]). Let T = (V, E) be a tree rooted at p with leaf-set 
ST. For functions / : — > [0, oo] and rj : 6T — ^ [0, oo], let CFN{l,ri) be the CFN model 
on T where 

• the edge length L{e)is equal to /(e) for all e & E not adjacent to 6T 

• L(e) = /(e) + ri{v) for all edges e = {u, v) with v G 6T. 

Let Maj{l,r]) = D{xip) ^ Maj{x{5T))) , where x ore the character values on T given by 
CFN{l,r]). 
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The distance D is the one provided by the joint probabihty distribution defined by 
CFN{l,r]), together with the independent coin-fiips necessary for breaking ties in the 

Maj function. In other words, Maj(Z,r/) = -log(l - 2P[x(p) ^ Maj(x(5r))])/2. 

The intuition behind the above definition is as follows. Let x be the character values at 
nodes of T, defined according to the CFN model given by edge lengths / on T. Suppose 
that the character value xi^^) each leaf u G S{T) is perturbed by an independent noise 
source such that the probability of perturbation is (1 — exp(— 2?7(m)))/2. Let be 
the perturbed character value, so formally: 

x{u)ALx{v)\x{u)yv G V{T) and x{u)ALx{v)\x{u)yv G 6{T) (1) 
Plxiu) ^ xiu)] = (1 - exp{-2ri{u)))/2 ^ D{x{u), x{u)) = v{u). (2) 

It is then an easy exercise to verify that our definition of x is equivalent to the one 
below: 

x{u) for u ^ 5{T) 
x{u) for u G S{T). 

For our purposes, the noise at the leaves of the subtree arises from the reconstruction 
of the character values by way of recursive majority. Our hope is that we can design a 
recursive learning procedure such that the probability of error P[x'('u) 7^ x{u)] remains 
bounded away from .5 as we progress deeper and deeper into T. Theorem 13.31 achieves 
this remarkable feat. Our formulation of the theorem is a specialization of Theorem 4.1 
in [TTj to binary trees and we state it without proof. 



Theorem 3.3 (Theorem 4.1 in [H]). Let 



For d G Z>o, Xmax > and < a(Amax) < a(2'')e~^'^^'"''^, there exists P{Xm.ax) > 0, 
such that the following hold. Let T be a d-level balanced binary tree and consider the 
CFN{l,r]) model on T, where max / < Xmax o,nd max?] < rjmax- Then 

Maj{l,ri) < max{r]max - log(a)/2, (3) 



Using Stirling's approximation formula, it can be shown that a(q) ^ y^V^- 
Xmax = Xq — e with Ao = log(2)/4 and e > (i.e. under the phase transition), we have 

thus for d large enough a(2'^)e~^'^'^'"''"' > 1. Setting a = 1 and rjmax = /3 in Theorem 13.31 
we obtain the following corollary: 

Corollary 3.4. For < Xmax < Xq, there exists do > such that: for any d > do, 

there exists P{Xmax,d) < 00, such that for any balanced d-level binary tree T and any 
functions I : E{T) — + [0, Xmax], rj : 6T ^ [0,/5], we have Maj{l,r]) < [3. 
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To put Corollary 13.41 in words, for trees with edge lengths less than Xmax, learning 
ancestral character sequences via recursive majority on sub-trees of height d, as detailed 
below, gives learned character sequences whose distance to the true sequences is recur- 
sively bounded by /3. This is the crucial result for the development of our algorithm, as 
it implies that recursive reconstruction with reliable, non-decaying accuracy is possible 
on trees of any size. 

Indeed, let us suppose that /(e) < Xmax = Aq — e for all e G E(T). Let d > do and 
P be as in the above corollary. We decompose T recursively into a collection of edge 
disjoint rooted trees in the following manner: start from the root p and follow all paths 
down the tree until each path reaches length d or terminates with a leaf. Cut the tree at 
the endpoint of each path and recurse on the subtrees rooted at these endpoints. This 
procedure divides T into trees of depth at most d. Let Ti, . . . be the collection of trees 
in the subdivision of T and let pi be the root of Tj for all i. See Figure El^a). 




Figure 1. Learning ancestral sequences by bottom-up recursion. 



We can now define a recursive learning process. The learned character value xi"^) is 
set equal to x(^) for all v G 6T. For each subtree Tj such that the value x(^) has been 
specified for all v G 5Tj, we define x{pi) = Maj(x(5Tj)). Now recurse as in Figure [3I^b). 

Note that some of the subtrees Tj may not be fully balanced as required by Theorem 
13.41 Suppose u G STi and the topological distance between u and pi is k < d. In this 
case we replace m by a balanced binary tree of height d — k with all edges of length 0, 
which is equivalent to giving xi'^) weight 2'^~'' in the Maj(x(5Tj)) vote. For clarity of 
exposition, we will keep the notation Maj to represent this weighted majority. 

Theorem 3.5. Suppose max{Z(e) : e G E{T)} = Xmax < Xq — e with e > 0, and let d 

and (3 he as in Corollary 3.4\ The procedure described above gives a bottom up learning 



process which ensures that D{x{p),x{p)) < 

Proof of Theorem 13. 5t Set rj^u) = D{x{u), x('w)) for all u G ST or u = pi for some 
i. We prove the following two conditions by bottom-up induction on the sub-trees Tj: 

• x{u)-lLxiv)\xiu),yv G V{Ti),u G 6Ti and xiu)-lLx{v)\x{u),yv G 5{T) 

• viPi) < P- 
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First, for all u G 6T, x{u) = x{u), so //(n) = 0. Both hypotheses are thus obeyed triv- 
ially for subtrees formed by a single leaf. This provides the base case for our induction. 

Now consider a subtree Tj and suppose 77 (n) < j3 for all u G 5Tj. If n G 6T, the first 
induction hypothesis is obeyed trivially, as x{u) = x{u). Alternatively, suppose u = pj. 
Let T be the subtree of T rooted at u. Then x{u) is a function of the values xi^T) and 
moreover the Markov property of the CFN model implies that x{^T)ALx{v)\x{u), for 
all V G V{Ti). Therefore x{u)ALx{v)\x{u). The other statement of the first induction 
hypothesis follows similarly. 

Finally, Corollary 13.41 implies 

r]{p,) = D{x{p.),x{p.)) = D(x(p.),Maj(x(5T,))) = m5(/(EtJ, r/^rj < 
so the second induction hypothesis is also obeyed. □ 



4. General outline of the algorithm TREE-MERGE 

Let Ao be the phase transition. Suppose the set of taxa X has cardinality n and 
the character sequences identifying the taxa have length A^. Given e > we define the 
following quantities: 

• ^maxi^) = Ao — e 

• d{\max{^)) and P{\max{^)) are the depth of the trees in the recursive majority 
decomposition and the upper bound on the learning noise, as in Corollary 13.41 

. A/r(e) = 24Ao + 6/3(A^,,(e)) + 12e. 

Given the length N of available sequences, the number of taxa n, and a user-define 
maximum allowed probability of error C,, we can pick e such that 



1.5 exp 



< (4) 
16n2 ^ ^ 



By Theorem l2.2l and Corollary l2.3l for any two character values u, v, learned or observed, 
drawn from the joint probability distribution P^j^, the empirical distance D{u,v) will 
be an (M, e/2)-approximation for the true distance D{u,v), with probability at least 
1 — n~'^C,/8. When an event happens with probability at least 1 — 0{n~'^C,), we say that 
it occurs with high probability. By Lemma 14.21 TREE-MERGE will evaluate no more 
than Sn^ empirical distances. By the union bound with probability at least 1 — the 
following condition holds: 

Condition 4.1 All the empirical distances evaluated by our algorithm are (M, e/2)- 
approximations of the corresponding true distances. 

Lemma 4.2. Algorithm TREE-MERGE reconstructs at most 3?t, ancestral sequences 
in addition to the n sequences at the leaves, and thus computes at most Sn"^ pairwise 
distances. 

Corollary 4.3. By the union bound applied to equation with probability 1 — ^ all 
the empirical distances observed by TREE-MERGE are {M,e/ 2) -approximations of the 
corresponding true distances. Thus condition (*) holds with probability 1 — ^. 
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Proof of Lemma 14. 2t Indeed, there are n — 3 internal nodes in any binary tree 
on n leaves. Any internal vertex of any subtree is also a node in the parent tree. Our 
algorithm progresses by joining at each step a pair of components of the forest through 
the addition of a new edge, creating zero, one or two more nodes of the forest. No nodes 
are ever destroyed. 

We define a clade of a tree T' to denote a subtree of T' that is induced by removing 
an edge e G E{T). Each edge e defines two clades and for each clade there is a natural 
rooting at the corresponding endpoint of e. For each internal node f of a forest, there 
are three clades rooted at v, each one induced by one of the edges adjacent to v. These 
clades correspond to the three "directions" leading away from v. 

Inspection of the algorithm TREE-MERGE shows that an internal sequence corre- 
sponding to a node/direction pair is learned by TREE-MERGE when the corresponding 
clade becomes "proper" (see Section E] for definition). Once a sequence is learned, it 
gets stored and is never modified, regardless of new growth in the corresponding clade. 
Each internal node of the tree will have exactly three learned sequences, each being 
constructed exactly once. Thus TREE-MERGE inspects at most n-|-3(?7, — 2) sequences 
and at most sequence pairs. □ 

We will prove that under condition TREE-MERGE will recover a topologically 
correct forest of edge- disjoint subtrees of the model tree T. If, in addition, the conditions 
of Theorem 14 . 5 1 hold . then TREE-MERGE will recover the entire tree. In the subsequent 
treatment we will generally assume that condition {-k) holds, unless otherwise stated. 

The algorithm TREE-MERGE progresses, as the name suggests, by gradually build- 
ing a sub-forest F of T, such that the following three invariants are obeyed: 

II: For any component Ti E F and any edge e G E{Ti), the path corresponding to e 

in T has length at least 2e. 
12: For any component Tj G F, all edges of Tj except at most one have corresponding 

paths in T of length at most Aq — e, and all edges have corresponding paths of 

length at most 2Ao — 4e. 
13: Any two connected components Tj,Tj G F are edge disjoint as subgraphs of T. 

Invariant II is needed in order to ensure that reconstructed ancestral divergence events 
are long enough to be reliable. 12 guarantees that ancestral sequences can be learned 
reliably from subtrees with edges under the phase transition. Finally 13 is a technical 
requirement of the algorithm. It allows us to reliably resolve topological information 
despite conditional dependencies between learned ancestral sequences. 
In order to ensure that II and 12 hold however, we need a reliable way to estimate 
edge lengths. As observed in Section [2], (^) guarantees that edge lengths which can be 
estimated as middle paths of quartets with diameter less than M will have an estimation 
error less than e. In Lemma 16.21 we show formally that TREE-MERGE will in fact 
estimate all edge lengths within e error. 

Given this fact, we can now enforce II and 12 by requiring instead that the following 
two conditions hold: 
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Algorithm F = TREE-MERGE(X, x) 

INPUT: n binary sequences x of length N corresponding to taxa X. 

OUTPUT: An unrooted forest F detailing partial information on the evolutionary 

relationships of the taxa X. 

(1) set F = X, i.e. F contains trees formed by single nodes. 

(2) insert all leaf taxa distances in NodeDistList. 

(3) insert all tree distances less than M/3 — e in TreeDist Queue. 

(4) while |F| > 1 

(a) if TreeDistQueue = 0, return F 

(b) (Ti,T2) = pop(TreeDistQueue). Let {Ei,E2) = TreeConnection(Ti, T2) 

(c) if I -Ell > 1 or I £'2! > 1 continue 

(d) Let Q = {a,b : c,d), where Ei = {(a, 6)}, E2 = {{c,d)}. Set T„ew = 
T1UT2U Q and compute the edge lengths of the quartet Q. 

(e) if Tnew violates condition CI, continue 

(f) if Tnew violates condition C2 continue 

(g) if 3Tfc s.t. TreeDistance(Ti, T2) + 3e > TreeDistance(Ti, T^) + 
TreeDistance(Tfc, T2), continue 

(h) else 

(i) F = F\{T^,T2}U{Tnew}. 

(ii) compute learned characters for all new roots of proper clades of Tnew 

(iii) insert all distances involving new learned characters in NodeDistList. 

(iv) UpdateTreeDistQueue(Ti, T2, Tnew) 

(5) return F 



Figure 2. Algorithm TREE-MERGE. 
CI: Each edge in F has estimated distance at least 3e. 

C2: For each Tj G F, the edges of Tj have estimated length at most Aq — 2e, with the 
exception of at most one edge, whose estimated length is less than 2Ao — 5e. We 
call such an edge a long edge. 
Philosophically, our approach is very similarly to the classical minimum spanning tree 
algorithms. At each step of the algorithm we will join two connected components Ti,Tj 
such that the new component does not violate CI and C2, and the estimated length of 
the path linking Tj and Tj is the shortest among all candidate pairs. This by itself does 
not guarantee 13. However, condition {-k) and step 4.(g) of TREE-MERGE achieve this 
purpose, as will be shown in formally in Section El 

We can now state the three main results of this paper. We postpone the formal proofs 
until Section [61 All of our results assume that A^,.^ and e are such that (jlj) holds, which 
in turn guarantees that condition (*) holds with probability at least 1 — 

Theorem 4.4. // (*) holds, algorithm TREE-MERGE returns a topologically correct 
sub-forest F of T satisfying invariants II, 12, 13. 
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Algorithm UpdateTreeDistQueue(Ti, T2, Q) 

INPUT: Subtrees Ti and T2 to be joined and old TreeDistQueue data-structure con- 
taining distances between "communicating" pairs of trees in F. 
OUTPUT: Updated TreeDistQueue data structure, 
for k i,j 

(1) Let El = {(a, b)} and E2 = {(c, d)}, where Q = {a, b : c, d) . 

(2) if (Ti,Tfc) or (T2,Tk) were ever in TreeDistQueue 

• For i = 1,2, let {E[,Ek) = TreeConnection(Ti, T^) if (Ti,Tk) E 
TreeDistQueue, or E^ = E(Ti) otherwise. 

• If Ek has not been set, let Ef^ = E{Tk). 

• if E[y^ El set = E[. 

• elseif E2 7^ E2 set Enew = E!^- 

• else Set E^ew = E{Q) 

• {Enew,Ek) = TTeeConnection{Tnew, Tk, Enew, Ek). 

(3) elseif D{u, i) < M/3 - e or D{v, i) < M/3 - e for some t e V{Tk), then 

• {Enew,Ek) = TTeeConnection{Tnew, Tk,E{Tnew),E{Tk)). 

(4) if a connection was found above 

• d = TreeDistance(T„e^, Tk, E^ew, Ek). 

• TreeDistQueue = remove(TreeDistQueue, {Ti,Tk), {Ti,Tk)), 
TreeDistQueue = insert (TreeDistQueue, (T„e«,, T^)). 

end 

Figure 3. Subroutine UpdateTreeDistQueue updates the connections 
between trees in the forest F, after two components are merged. 

Theorem 4.5. LetT satisfy Qe < L{e) < Aq — 3e, Ve G E{T). Then given N independent 
samples xi,---,Xn from the character distribution Pt,l, T will be fully and correctly 
recovered by TREE-MERGE with probability at least 1 — ^. 

Theorem 4.6. TREE-MERGE always terminates in 0{Nn'^ + n^) time, where the pro- 
portionality constant is a decreasing function of C, and e. 

Theorem 14.51 and equation (jlj) provide us with specific edge-length bounds for trees 
that can be reconstructed with probability at most ^ from sequences of length N. Indeed, 
for any and ^, there is a lower bound e^^^^ such that any e > eN,£, satisfies inequality 
(jl]), which insures {-k) will hold. 

This is an important feature of TREE-MERGE, as it allows us to recover an edge- 
length reliability interval from the available sequence lengths. In contrast, previous 
research has focused on recovering the necessary sequence lengths for full reconstruction, 
assuming that lower and upper bounds on edge lengths were known. As these bounds 
cannot be known a-priori, this is hardly useful, especially for algorithms which do not 
provide partial information in case full reconstruction is not feasible. We note however 
that under this paradigm, our methods still achieve the assymptotically best known 
sequence length requirements. 
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Corollary 4.7. For trees T satisfying 6e < L(e) < Aq — 3e for some fixed e, equation 
0) and Theorem\4^ show that N = 0,,g(Iog(n)) sites suffice for TREE-MERGE to 
reconstruct T with probability at least 1 — 

We finally note that the constant factor in our running time bounds depends on the 
desired maximum probability of failure ^, and on e. If one is to consider these parameters 
as fixed, then the running time is indeed simply 0{n^ + n'^N). A higher e value implies a 
faster running time and shorter sequence length requirements, but trades off against the 
size of the reconstructed forest. Similarly, a higher value for ^ implies shorter sequence 
length requirements, but trades off against the accuracy of the reconstructed forest. 
The specific dependencies between these parameters however are very complicated and 
beyond the scope of this paper. 

5. A CONDITIONAL INDEPENDENCE TOOLKIT 

In this section we present four lemmas which are the main workhorses of our algo- 
rithms. All the results presented here hold in general for trees with arbitrary edge 
lengths, as they are qualitative statements which do not depend on the accuracy of 
the learned character values. This section, together with the proof of Lemma 15.41 in 
Appendix \^ provide a stand-alone toolkit of useful new results in this area. 

To simplify notation, here and in the remainder of the paper, for a node v G V{T) we 
will use V to also denote the character value ^is the distinction will be clear from 

context. Let an induced subtree T' of T be a subtree such that S{T') C S{T). For an 
induced subtree T' rooted at p, we will denote by p(T') the character value x'{p) that is 
"learned" from x((5T') by recursive majority on T', as described in the previous section. 
We also denote by V{T) fl T' the vertices of T that are either in V{T') or lie on the 
paths of T corresponding to the edges of T'. Finally, for two nodes u,v E V{T) we let 
P(m, v) be the path connecting u and v in T. 

We let D denote the distance between uniform Bernoulli random variables defined in 
the Introduction, where the underlying joint probability distribution is the one given by 
the CFN model on T, ¥'rp and the random coin tosses involved in the recursive majority 
learning of ancestral characters. 

The following two lemmas are present and used almost identically in [6]: 

Lemma 5.1. Let Ti,T2 be edge-disjoint subtrees of T rooted at pi and p2, such that 
6Ti,6T2 C 6T. Let vi G V{T) fl Ti and G V{T) fl T2 be the endpoints of the path 
P{vi,V2) joining Ti to T2 along the edges ofT. Then 

{pi(Ti)ALv\x) and {pi{Ti)ALp2(T2)\x), 

for any v G V{T) fl T2 and x G P{vi,V2)- See Figure^ a). 

Proof of Lemma I5.lt The nodes of Ti are separated from those of T2 by x. By the 
Markov property of the CFN model, {5TiALv\x) . Since pi(Ti) is a deterministic function 
of 5Ti and independent coin flips (tie-breakers in the recursive majority), we conclude 
{p{Ti) ALv\x) . The proof of the second statement is almost identical and is omitted. □ 
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(a) (b) 

Figure 4. Illustration of conditional independence statements in 
(a) Lemma [5. II and (b) Lemma [5 .21 

Lemma [5.21 gives us a way to reliably estimate lengths of internal paths of T, and the 
subsequent easy corollary shows that if errors in the empirical distances are less than 
e/2, then the path length estimates are correct to within e. 

Lemma 5.2. Let oq, bo, cq, do G V(T) inducing topology Q = (ao? ^o|co, c^o) on T, and let 
I be the length of the middle path of Q. Let Ta,Ti„Tc,T(i be induced subtrees ofT rooted 
at a, b, c, d respectively, containing Oq, bo, Cq and do respectively, such that Q, Ta, Tj,, Tc, 
are pairwise edge disjoint. Then 

FPM{D; a{Ta), b{n), c{T,), d{Td)) = Q, 
ME{D-ra{Ta);b{T,)\c{T,)J{T,)) = I. 

Corollary 5.3. If \b{x,y) - D{x,y)\ < e/2, Vx, y E {~a{Ta),b{Tb),c{T,)J{Td)} and 
I > e, then 

FPM{D; a{Ta), b{T,), d{T,), d{Td)) = Q, 
\ME{D;h{Ta)MTbMTc),d{Td)) - l\ < e. 
Proof of Lemma 15. 2t By repeated application of Lemma 15.11 we obtain 

D{x{T^),y{Ty)) = D{xo, yo) + ^(a^o, x{T-,)) + D{yo, y{Ty)), 

for all x,y E {a,b,c,d}. Plugging the above equality into the definition of FPM and 
ME yields the desired result. □ 

The next Lemma provides a restriction of the triangle inequality for characters and 
learned characters under the CFN model. As mentioned in the Introduction, the main 
difficulty with using learned character sequences at internal nodes is that these char- 
acter sequences depend non-trivially on the leaves of T. This destroys the conditional 
independence relations which turn our distance measures into additive metrics and hin- 
ders the identification of speciation events from pairwise distance information. Lemma 
15.51 shows a case where the conditional dependence relations induced by using learned 
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character sequences will act in our favor through a version of the triangle inequality: 
Lemma 15.41 

Lemma 5.4. Let T' be an induced subtree ofT rooted at p and letv G {V {T)nT')\6{T') . 
Then 

D(p{T'),v)<D{p,v) + D{prp{T')). (5) 
Proof of Lemma 15. 4t See Appendix |X1 □ 

Lemma 15.41 provides the foundation for the next result, our main workhorse in the 
progressive construction of the topology of T. 

Lemma 5.5. Let T' and be edge disjoint induced subtrees ofT. Let o be an internal 
node ofT' and let a,b,c be its neighbors in T' . Let Ta,Ti„Tc be the clades ofT' rooted 
at a, b, c respectively which do not contain o. Suppose that the shortest path from T^ to 
does not pass through b or c. Then 

FPM{D;d{T,),b{Tt,),d{T,),d{T,)) = {d,a\d,b). 




Figure 5. Properly connecting induced subtrees by inferring quartets on 
learned character values. 



Proof of Lemma 15. 5t Our assumptions imply, by repeated application of Lemma 

o 

D{x,y) = D{x,x) + D{x,y) + d{y,y) for all x,yE{a,b,c} (6) 

Let d' be the node of T where the path from T^ to o intersects U P{o, a). There are 
two cases: either d' is on the path from o to a, or d' ETaH V(T). See Figure [51 
Case 1: ci' is on the path from a to a. Lemma [5.11 yields: 

D{d,d) = D{a,d,) + D{a,d') + D{d'J) 

D{b,d) = D{b,b)+D{b,o) + D{o,d')+D{d',d') 

D{dJ) = D{c,c) + D{c,o)+D{o,d')+D{d',d') 
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Combining the above with ([6]) gives 

D{h, b) + D{d, d) = D{h, 5) + D{b, d) = D(b, c) + D{d, d) + 2D(o, d'). 
Case 2: d' E Ta- Lemma EH] yields: 

D{a,d) = D{d',d) + D{d',d) 

D{b,d) = D{b,b) + D{b,o) + D{o,a) + D{a,d') + D{d',d') 
D{d,d) = D{c,d) + D{c,o) + D{o,a) + D{a,d') + D{d'J') 
But by Lemma [5.41 D{d', d) < D{d', a) + D{a, a'), and therefore 

D{d, b) + D{c, d) = D{d, c) + D{b, d) > D{b, c) + D{d, d) + 2D{o, a). 
In both cases the statement of the lemma follows by the definition of FPM. □ 

Corollary 5.6. Given the hypotheses of Lemma \5.5\ suppose ME{D;a,b\c,d) = I > e, 
and \D{x, y) — D{x,y)\ < e/2, Vx, y G {a, b, c, d}, then 

FPM{b- 5(TJ, 6(T,), g(T,), d{T,)) = {d, d\d, b) 
ME{D; ao(Tj, boin)\co{T,), Jo(Trf)) > / - e. 

In essence, the above corollary states that, when the {-k) condition is obeyed, FPM 
estimates topological information correctly. 

6. Implementation details 

This section provides all the implementation details for algorithm TREE-MERGE. 
We let M, e, ^ be as determined in Section HI For the remainder of the paper we will 
assume, unless otherwise stated that condition (*) holds. As mentioned previously, we 
maintain an edge-disjoint sub-forest F of T, such that, with high probability, the in- 
variants II, 12 and 13 are satisfied. Under we are able to maintain II and 12 by 
enforcing CI and C2. 

These invariants are crucial for our ability to ensure that topological information can 
be reliably estimated from learned sequences (II and 13), and that learning of ancestral 
sequences can be performed reliably (12), by learning via recursive majority on "proper" 
clades, which are guaranteed to have edge lengths under the phase transition: 

Definition 6.1. Given a subtree T' E F , a clade T" of T' is called proper if all the 
edges ofT" have estimated lengths shorter than Aq — 2e. 

Let V be the root of clade T". By {-k) and Lemma [6. 2 [ any proper clade has true edge 
lengths less than Aq — e, which guarantees that the learned character sequence v{T") is 
at distance at most (3 from the true sequence at v. Given an internal node v of some 
sub-tree T' E F , there are three clades of T' rooted at v. Given any edge e G E{T'), 
we let T'{v, e) denote the unique clade of T' which is rooted at v and does not contain 
edge e. Letting Cf, be the long edge described by condition C2, we see that for each node 
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Algorithm d = EdgeLength(e, T'). 
INPUT: Edge e of a subtree V e F. 
OUTPUT: Estimated length of e. 

(1) Let a, b,c,d G V{T') be the four neighboring nodes of e in T': e is the middle 
edge of the quartet Q = (a, b\cd). 

(2) if some edges of T'{a, e) have not been estimated and/or T'(a, e) is not proper, 
let a' be the closest descendant of a in T'(a, e) such that T'(a', e) is proper. Set 
a = a'. 

(3) repeat the above process for b, c, d 

(4) if the diameter of Q = {d{T'{a,e)),b{r{b,e))\c{T'{c,e)),d{T'{d,e))) is higher 
than M - e return FAIL 

(5) return d = ME(l);Q). 



Figure 6. Procedure EdgeLength computes the lengths of new edges 
which may be added to F by joining two of its component trees. 

V G V{T'), T'{v,eb) will be proper. Thus D{v,v(T'{v,eb))) < (3: under invariant 12, we 
can reliably learn the ancestral sequences of all nodes in F. Thus at any point in the 
algorithm, the sequence at any vertex of F can be learned from some proper clade of F, 
rooted at that vertex. 

Lemma 6.2. Suppose the forest F reconstructed by TREE-MERGE at some interme- 
diate step is topologically correct, contains edge-disjoint trees, its edge lengths have been 
computed to within e error, obeys conditions CI and C2 (and hence obeys II and 12), 
and all distances between pairs of trees appearing in TreeDistQueue have also been esti- 
mated to within e error. Then, under condition {-k), the estimated edge lengths computed 
at step 4(d) of TREE-MERGE are also correct within e. 

Proof of Lemma l6.2t We use the notation of the TREE-MERGE pseudocode. Since 
Ti and T2 are candidates for being joined, the estimated length of the middle path {u, v) 
of Q, which was computed at a previous iteration of TreeConnection, is at most 2Ao — 5e, 
and is correct up to e by our hypothesis. Similarly, (a, b) and (c, d) are edges of F and 
obey II. Thus all edges of Q are less than 2Ao — 4e. 

Since the length of {u,v) has been estimated, we only need to estimate edges (a, m), 
[b, u), (c, v) and {d, v). It is an easy exercise to prove that the two neighbors of a either 
root a proper clade in Ti which does not contain a, or have a neighbor who roots such 
a clade. This follows by C2. It follows that the edge (a, u) can be estimated from a 
quartet of diameter at most 6A0 + 2/3. Thus the procedure EdgeLength will estimate it 
within e. We proceed by symmetry for the other edges. □ 

We next give the details of the TreeConnection and TreeDistance subroutines, which 
find the topologically correct way to connect two sub-trees Ti , T2 . TreeConnection re- 
quires seed nodes Uj G V{Ti), i G {1,2}, rooting proper clades T{ of Ti and T^ of T2 
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such that D{ui{T[),ui(T[)) < M/3 — e. The algorithm proceeds by moving along Ti 
and T2 in the direction indicated by quartet tests around the current candidate node. 
Lemma 16.31 shows that TreeConnection will find the correct link between Ti and T2, 
given "sufficiently close" seed nodes. 

Lemma 6.3. Suppose Condition (*) holds. Let Ti and T2 be subtrees of T satisfying 
invariants II and 12, and let P = (fi,f2) be the path joining them in T, with vi G 
V{T) n Ci, and e E{Ti), i = 1,2. Let Ei C E{Ti) and E2 C ^(Tg), such that the 
following hold: 

• ei G -El and 62 G E2 

• there exist proper clades T[ and of Ti and T2, rooted at ui G V{Ei) and 
U2 G V{E2) respectively, such that D {ui{Tl) , U2{T2)) < M/3. 

Then ei G Ci and 62 G C2, where{Ci,C2) = TreeConnection{Ti,T2, Ei, E2) . Note that 
Ci either contains a single edge or three adjacent edges of Ti . If all the edges of T' 
have length at least 2e, then Ci = {ci} and C2 = {62}- Furthermore, if \Ci\ = 3, then 
D{vi,Ci) < 2e, where Ci is the center node of Ci. 

Proof of Lemma 16. 3t By Lemma 15. 

D{MT[),U2{T2)) = DiMn),v^) + D{v,,U2{Ti)). 

Therefore D{vi,U2{T2)) < M/3. Now suppose ei = {v[,v'(). This edge defines two 
clades in Ti, at least one of which is proper; we may assume w.l.o.g. that v[ roots a 
proper clade. Let u'i,u'{ be the descendants of v'^ in said clade, and let T/jTg be the 
corresponding sub-clades. Then 

D{u'[{T';), U2{T^)) = D{u'[{T^), u'i) + D{ul v,) + D{v,, M^)) <P + iXo + f<f, 

thus Ml = v'i,u[,Ui will satisfy the conditions of step 1 in subroutine TreeConnection. 

In turn, let Ui,u[,u'( satisfy the conditions of step 1 in subroutine TreeConnection. 
Then at least one of u'^, u'{ is not on the path P(f 1, Ui). We may assume u'^ ^ P{vi, mi); 
then Ml G -P(mi, u'^) and thus 

M/2 + e>DK(ro,M2(r^)) = /^K(T0,Mi) + /^(Mi,M2(T^)). 

Thus M/2 + e > D{u^, ^^2)) = ^K, ^i) + ^(^'1, M^))- 

Letting a,b,c be as described in TreeConnection, Lemmas 15.11 and 15.41 imply that 

D(S(Ti), U2{n)), D{b{T,), U2{n)), D{d{T,), U2{n)) < M/2 + e + 3Ao + /5 < M - 2Ao. 

Suppose w.l.o.g. that (a,6|c, M2) is the true quartet topology induced by T. Suppose 
the middle edge of (a, b\c, U2), namely (wi, vi), is shorter than e. Since Ti obeys invariant 
(1), this implies that Vi is indeed a neighbor of ui in T'. Furthermore, 

ME{D; {aiTa)rc{Tc)Hn),U2{T^))) < ME{D; (S(TJ, g(T,)|6(r,), M2(T^))) < e. 

Similarly ME{D; (b{Tb),c{Tc)\a{Ta),U2{T2))) < e. So in case TreeConnection picks the 
wrong direction, it will immediately exit correctly at step 2.f. 
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Algorithm {Ei,E2) = TreeConnection(Ti, T2, -Ei, -E2) 

INPUT: Subtrees Ti,T2 G F and candidate sets Ei C E{Ti),E2 C E{T2) containing 

the endpoints of the path joining Ti and T2. 

OUTPUT: Refined candidate sets Ei e E{Ti),E2 e ^(Tg). 

(1) Let Ml G V{Ei) and U2 G ^(£"2), and T2 be a proper clade of T2 rooted at U2- 
Let T[,Tl' be edge disjoint proper subtrees of Ti rooted at neighbors of ui in 

Ti, n; and <. Suppose l)K(T{), M2(T^)) < M/2 + e, and Z)(m'/(Ti"), U2(T^)) < 
M/2 + e. If no such Ui, U2 exist, then return 0. 

(2) while l^il > 1 

(a) Let a, 6, c be the neighbors of Ui. 

(b) Let Ta, Tft, Tc be edge disjoint clades of Ti rooted at a, 6, c respectively. 

(c) if Ta is not proper 

• Let a' be the descendant of a in which roots a maximal sub-clade 
of Ta which is proper (does not contain the long edge of T^. Set 
a = a', Ta = Ta'. 

(d) Do the same as above for b, c. 

(e) Let Q = FPM{D;d{Ta),b{Tb),c{Tc),U2iT2)), with Q = iu2,x\y,z), 
{x,y,z} = {a,b,c}. 

(f ) If ME{D; Q) < e, set Ei to the set of edges incident to ui and go to step 
(3). 

(g) Set El = £1 n (E(r,) u {(mi,x)}. 

(h) Set Ml = X. 

(3) Repeat the same process to restrict E2 C E{T2). 



Figure 7. Subroutine TreeConnection(Ti, T2, -Ei, -E2) finds the edges 
of Ti,T2 containing the endpoints of their connecting path P. 
TreeConnection will output a single edge per tree, or, in case P connects 
too close to an existing node, the edges adjacent to that node. 

Alternatively, if D{ui^vi) > e, TreeConnection will pick the correct "direction", by 
Lemma 15.51 and Corollary 15.61 Furthermore, if the middle edge is longer than 2e, its 
estimated length will be also larger than e, and thus the algorithm will proceed to the 
next iteration of the while loop. This implies the last statement of our lemma. Also, 
if Vi does not lie on the path P{ui,c), then D{ui,Vi) > D{ui,c) > 2e, and thus the 
algorithm will proceed to the next iteration of the while loop. 

To complete the argument, either: vi G P{ui,c) or c G P{vi,ui). In the first case 

D(c, M2(T^)) < D{ui, U2{T:,)) + 2Ao < M/2 + e + 2Ao, 

and the procedure will terminate at the next iteration of the while loop. In the latter 
case, 

D(c, U2{T^)) < D{ui, U2iT^)) < M/2 + e. 
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Subroutine d = TreeDistance(Ti, T2, Ei, E2) 

INPUT: Subtrees Ti,T2 G F and candidate sets Ei C E{Ti),E2 C E{T2) containing 
the endpoints of the path P joining Ti and T2. OUTPUT: Estimated length of P. 

(1) if = IE2I = 1 

• Let Q = {a,b\c,d) with Ei = {(a, 6)}, £'2 = {{c,d)}. Let e be the middle 
edge of Q. 

• Let T' be the tree given by joining Ti and T2 according to Q. 

• return EdgeLength(e, T'). 

(2) else 

• Let El = {{v, vi), (v, V2), {v, vs)}. 

• return mm{TTeeDista.nce(Ti{v, {v,Vi)),T2, {{j, k)}, E2), {i, j, k} = 
{1,2,3}} 



Figure 8. Subroutine TreeDistance(Ti, T2,Ei, E2 estimates the length of 
the path connecting Ti,T2, based on the set of possible connection edges 
Ei,E2, output by TreeConnection(Ti, T2). 

and we can proceed by induction on li^il to show that at every step TreeConnection 
picks the correct direction or exits correctly. □ 

Lemma 6.4. Assume all the hypotheses and notation of Lemma \6. 3[ hold. Let (Ci, C2) = 
TreeConnection{Ti, T2, Ei, £'2)- Then \ TreeDistance{Ti, T2, Ci, C2) — -^(-P)I < e- 

Proof of Lemma 16. 4t All the quartets inspected by TreeDistance where previ- 
ously inspected by TreeConnection as well. The proof of Lemma [673] shows that the true 
diameters of all said quartets are less than M — e. Thus (*) and Corollary 15.61 imply 
that all path lengths returned by EdgeLength will be e-approximations of the true path 
lengths. Furthermore, it is an easy exercise to see that TreeDistance returns L{P) in 
the event that D = D. The conclusion follows trivially. □ 



7. Correctness, stopping criteria and running time analysis 

In this section we prove the three main theorems of the paper, which were stated in 
Section HI 

Theorem 14. 4t // (-k) holds, algorithm TREE-MERGE returns a topologically correct 
sub-forest F of T satisfying invariants II, 12, 13. 

Proof of Theorem 14. 4t We proceed inductively to show that invariants 11,12,13 are 
always obeyed by TREE-MERGE, and additionally that: 

CI': edge lengths computed by EdgeLength are correct within e. 
C2': all connections computed by TreeConnection are correct, 

C3': tree distances contained or previously contained in TreeDistQueue are correct 
within e. 
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All of the above hold trivially under (*) in the case of F = X, which is our base case. 
By Lemma 16.21 CI' will hold inductively. Then steps 4.c, 4.e and 4.f, together with 
Lemma [6.3^ show that CI and C2 are never violated. In turn, CI, C2 and CI' imply 
that invariants II, 12 hold for the next iteration of the algorithm. 

Lemma 1631 shows that the TreeConnection sub-routine returns a topologically correct 
way of connecting to components of F, as long as II and 12 hold and the candidate 
edge sets Ei and E2 contain the endpoints of the correct linking path. It is an easy 
argument to show that if the previously computed connections were correct, then step 
Up dateTreeDist Queue calls TreeConnection with appropriate sets of candidate edges. 
This establishes condition C2'. Lemma 16.41 together with condition C2' thus imply 
condition C3'. 

To complete the proof of the theorem, it remains to show that invariant 13 is obeyed, 
namely the components of F are disjoint in T. Suppose by contradiction that Ti,T2 
are the first pair of subtrees that are joined such that the path P linking them overlaps 
subtree T3 G F. Suppose / is the true length and I' the estimated length of the path 
P, computed by TreeDistance(Ti, T2). Let the distance between Ti and T3 be h, the 
distance between T2 and T3 be I2, and let l[ and I2 be the corresponding estimated 
distances. Since Ti, T2 were joined, 3e < /' < 2Ao — 5e. Therefore / < /' + e < 2Ao — 4e. 
Then li + I2 < I < 2Ao, which implies that the two estimated distances, l[ and I2, were 
computed at a previous step of TREE-MERGE. Then 

l[ + l'2-2e < k + k < I < I' + €^ + < I' + 3e, 

which contradicts step 4.d of TREE-MERGE. □ 

Theorem SSI Let T satisfy 6e < L{e) < Aq - 3e,Ve G E{T). Then given N 
independent samples Xi, ■ ■ ■ , Xn from the character distribution Pr,L, T will he fully and 
correctly recovered by TREE- MERGE with probability at least 1 — ^• 

Proof of Theorem 14. 5t As before, {-k) holds with probability 1 — ^. Theorem 14.41 
shows that the output of the algorithm is topologically correct. It remains to prove 
that, under the additional hypotheses of the present theorem, TREE-MERGE will not 
terminate before the full topology is resolved. 

Let us suppose that TREE-MERGE outputs a forest F with more than one compo- 
nent. Let Tp be the tree given by collapsing every connected component of F into a 
single node. All edges of Tp correspond to single edges of T. Since all edges in T are 
longer than 6e, Lemma 16.31 shows that TreeConnection will always output well-defined 
connections (i.e. candidate edge sets of cardinality 1), and moreover all internal edge es- 
timates will be longer than 5e. Thus condition CI is never violated and TREE-MERGE 
will never reject a candidate pair at steps 4.c. or 4.e. 

Let Ti and T2 form a cherry of Tp- Suppose the common neighbor of Ti and T2 in 
Tp does not correspond to another component of F. The length of the path P joining 
Ti and T2 will be less than 2Ao — 6e and therefore the pair Ti,T2 was inserted into 
TreeDistQueue. Letting T„e«, be the tree formed by joining Ti and T2, we see that all 
edges in Tnew other than the one corresponding to P, correspond to single edges of T, 
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Figure 9. The connectivity graph induced by T on F, with configura- 
tions of pairs of subtrees which can be joined by TREE-MERGE. 



and have lengths between 6e and Xq — 3e. Thus T„e«, will satisfy condition C2 under 
(★), and will not get rejected at step 4.f. of TREE-MERGE. 

Alternatively, suppose the common neighbor of Ti and T2 in Tp corresponds to T3 G F. 
Then T3 contains at most one "long" edge, by C2. Thus the tree obtained by joining 
T3 to Ti will also contain at most one long edge, as all edges of the new tree which are 
not also edges of T3 correspond to single edges of T, and thus are "short". Thus, again, 
the pair Ti,T3 will not be rejected at step 4.f. 

The only remaining possibility is that the candidate pair Ti,T2 gets rejected at step 
4.g.. From our selection of the candidate pair, we can see that the joined tree Tnew is 
in fact edge disjoint from all other trees in F. Let Tk be the tree causing the rejection 
at step 4.d.. We let IJiyh be the lengths of the paths joining Ti,T2, Ti,Tk and T2,Tk, 
and l',l[,l2 be the corresponding estimated tree distances. Since Tnew and T^. are edge 
disjoint and all edges of T have length at least 6e, it is a simple argument to show that 
li + I2 > I + 6e, and thus l[ + < I' + 3e cannot occur under (jk). □ 

Theorem 14. 6t TREE-MERGE always terminates in 0{Nn'^ + n^) time, where the 
proportionality constant is a decreasing function of ^ and e. 

Proof of Theorem [Ml Steps 1-3 of TREE-MERGE trivially take 0{n^N + n^ log(n)) 
time. Every iteration of Step 4, the main loop of the algorithm, either reduces the size 
of the forest F by one, or determines that a pair of trees in F cannot be merged. Each 
time the forest gets modified, a single new tree is produced. Since the forest is modified 
at most n — 1 times, throughout the life of the algorithm there are at most tree pairs 
being inserted/popped from TreeDist Queue. Thus there are at most 0{n^) iterations of 
step 4. In particular, the total time spent in steps 4.a-d is 0(n^log(n)). 



24 



RADU MIHAESCU, CAMERON HILL, AND SATISH RAO 



In order to verify CI and C2 on T^,^^ one only needs to compute a fixed number of 
edge lengtfis in addition to the ones already in the forest, so again, steps 4.e-f take 0{n'^) 
time. The verification at step 4.g takes linear time per iteration, so at most O(n^) time. 

Step 4.h is equivalent to a modification of the forest, so it only occurs at most n — 1 
times. Learning the sequences of new proper clades can be done in a bottom-up fashion, 
such that each new root can be computed through a single recursive majority step. Thus 
step 4.h.(ii) takes time proportional to the number of new learned sequences, and its 
contribution to the total running time is 0{nN). Similarly step 4h.(iii) will contribute 
at most 0{n'^N + log(n)). 

Now suppose that |V^(T„e«))| = i and |F| = s. The subroutine TreeConnection 
runs in time at most linear in the sizes of its input subtrees. Thus one iteration of 
Up dateTreeDist Queue will spend 0{st + {n — t)) time for building the new tree connec- 
tions, and 0{s log(n)) time in the insertion and deletion of tree pairs from TreeDistQueue. 
Summing over all iterations, the total time spent in UpdateTreeDistQueue is 0{n^). □ 



8. Final remarks 

A simple amortized argument shows that step 4.h, as detailed here, only takes 0(n^(A^-|- 
log(?7,))) time. We do not include this argument for the sake of brevity. Thus the verifi- 
cation of step 4.g is the true running-time bottleneck of the algorithm. In practice, this 
verification should be rendered somewhat redundant by the fact that at every step we 
join the pair of trees which are closest, but sadly this is not sufficient for a formal proof 
of correctness. 

Our methods here are general enough to specify a all-purpose phylogeny reconstruction 
algorithm. The bounds we require on edge lengths can in fact be relaxed at the cost of 
longer sequence lengths. If one is able to estimate the lengths of the already-constructed 
edges, then one can also estimate the expected disagreement between the real and learned 
sequences at internal nodes. Similarly, with given sequence lengths one can infer a 
"robust" area of the space given by M and e: for every estimated distance we can infer 
an e that is larger than the estimation error with high probability. If the phylogeny can 
be progressively disambiguated from the available distances, given their expected errors, 
then we have achieved our purpose. 

Indeed, these ideas are at the base of the methods in [19]. The availability of Mos- 
sel's techniques for inferring ancestral sequences simply give us a very powerful tool for 
"reaching deeper" into the phylogenetic tree and improving on classical distance methods 
without departing too much from their simplicity. 
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Appendix A. The triangle inequality for the CFN model 
The main purpose of this Appendix is to provide a proof of Lemma 15.41 

Lemma l5.4l Let T' he an induced subtree ofT rooted at p and letv G (V (T)r\T')\6{T') . 
Then 

D{p{T'),v)<Dip,v) + Diprp{r)). 

with p{T') a "learned" character value, where the learning occurs by any bottom-up 
recursive majority algorithm on T' , as outlined in Section\^ 



P 




Figure 10. Tree configuration for the proof of Lemma [5. 4[ 



We begin by introducing an alternative representation of the CFN model under a 
percolation framework. This intuitive view lies at the root of the theoretical results 
regarding information flow on trees in [16] and [T7] . 

Let p(e) < 0.5 be the probabilities of mutation along edges e G E{y) for a CFN model 
on T. Let a(e) be independent random variables such that 

. , f 1 with probability 1 — 2p{e) 
a[e) = < 

1 with probability 2p{e). 

Suppose each edge e in T carries a survival probability 9{e) = 1 — 2p(e), such that 
the edge e is deleted if a{e) = 0. After removing the destroyed edges, each surviving 
connected component C receives a single character value x{C), by tossing an independent 
unbiased coin. We write u ^ v for the event that the two nodes u, v are in the same 
connected component and for the component containing v. 

It is easy to see that the joint probability distribution on character values at V{T) 
produced under this alternative model, Pt,^, is the same as the one induced by the orig- 
inal CFN model: Pt,l, where L(e) = -log(l - 2p(e))/2 = - log(^(e))/2. As before, D 
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is the distance between uniform binary random variables defined in Section [2l 

Proof or Lemma 15. 4t Let p denote p{T') for brevity. The lemma is equivalent to 
E[t'p] > E[pf]E[pp]. It is an easy exercise to show that, in turn, this is equivalent to 
^[p = p\p = v\> P[p = pIp 7^ v\. By symmetry, we may assume for the rest of the proof 
that p = 1, so x(Cp) = 1, and our task reduces to showing that 

P[p = l|p = v = l]> P[p = l|p = 1 ^ 

Let E{P) = {ei, . . . ,6^} denote the edges of the path P = P{p,v), and let V{P) = 
{vi, . . . ,Vs = v} he the nodes of P, other than p. Then l(p u) = l{a{E{P)) = 1). 
We proceed by way of a standard coupling argument. 

Suppose a{E{T)) is such that such that p ^ v. Given a set of values Xo for the 
characters C ^ C^,C ^ Cp, 

P[X{C,) = l,x{C^.,p) = Xo\a] = P[x(a,) = -l,x(C^.,p) = XoH 

Now p is a recursive majority function in the character values at 6T', and is therefore 
coordinate-wise increasing in the values of those characters. Moreover x((5T') in the 
event l[x{Cy) = l,x(C*p) = l,x{Cj^v,p) = Xo] is coordinate-wise larger than xi^T') in 
the event l[x(C„) = — 1, x(Cp) = ^,x{C^v,p) = Xo], while the probabilities of the two 
events, conditioned on the values a, are the same. Summing over all values xo and all 
values a such that p f , 

P[p = l\p y^v,V = l]> P[p = l\pi^v,V = -1] = P[p = l\v = -1]. (7) 

For any x G {±1}'^ and any b G {±1}* with t = \E{T) \ — s, an identical argument to 
the one above shows that 

P[p = l\a{E{T)\E{P)) = b, x{V{P)) = 1]> P[p = l\a{E{T)\E{P)) = b, x{V{P)) = x]. 

We observe that l[p ^ v] = l[a{E{P)) = 1] implies l[x{V{P)) = 1] and that a{E{T) \ 
E{P)) and a{E{P)) are independent, thus a{E{T)\E{P)) and x{V{P)) are independent. 
Therefore 



P[p = l\a{E{T)\E{P)) = 


b, V 


4-> p] 


> 


P[p = 


l\a{E{T)\E{P)) = 


--b,x{V{P)) = 


x], Vx 


P[P = 


l\v 


^ p] 


> 


P[p = 


l|x(V(P)) = x],Vx 






P[p = 


l\v 


^ p] 


> 


P[p = 


l\a{E{P)) = a,v = 


l],Va G {±1}' 


,a ^ 1 


P[p = 


l\v 


^ p] 


> 


P[p = 


l\v ^ p,v = 1]. 







The first implication follows from summation over all values of b. The second comes from 
summation over all values of x such that Xs = v = 1 and x is compatible with a{E{P)) = 
a. The third implication follows from summing over all values a G {±1}'^, a ^ 1. 
The last inequality, together with ([7]), implies 

p[p = l\v = 1] > P[p = l\v p,v = l]> P[p = l\v ^ p,v = -1] = P[p = \\v = -1]. 



□ 
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Appendix B. Applicability to other molecular models of evolution 

Our method implies similar results for all group based models of evolution, where 
character alphabet is a group G admitting a non-trivial morphism : G — *• Z2. In a 
group-based model of evolution, the probability of transformation of the character x 
from state a to state h along any edge e of the tree only depends on a~^h. In other 
words, for an edge e = (u, v) G E{T), 

P{xiv) = b\xiu) = a) = Peia'^. 

By the definition of a morphism, 

(j){a) ^ (f){b) ^ (j){a-^b) = -1. 

Thus 

mx{u)) ^ = a] = P[0(xH) 7^ ^ixiv)mxiu)) = 0(a)] 

g6</,-i(-l) 

which does not depend on a and implicitly does not depend on 0(a). 

We can then reduce any such model to the binary one by identifying a state g E G 
to 0(5') e Z2 and applying our analysis mutatis mutandis. The most notable example 
of group based model of evolution satisfying our requirements is the Kimura 3ST model 
|15] . which is realized by the group Z2 x Z2 [20]. We also note that Kimura 3ST is a 
generalization of the well known Jukes-Cantor model. 
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